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Abstract — inverse Multiplexing for ATM (IMA ) has been stan- 
dardized by the ATM Forum to provide a high-capacity logical 
link by grouping several lower-capacity physical links. IMA is 
intrinsically point-to-point, assumes congestion-free links, and 
only tolerates relatively constant differential delays between the 
links. In this paper, we generalize the notion of inverse multi- 
plexing by introducing switching and buffering within the inverse- 
multiplexed segment. Thi new scheme, which we call Switched 
Connection Inverse Multiplexing for ATM (SOMA) allows to 
implement an X x A* switch with ports operating at rate kR by 
usingak.X x kS core ATM switch with ports of rate R. 

SOMA executes demultiplexing on a selected set of virtual 
connections, dynamically distributes their traffic onto a group 
of switching paths, and performs egress re assembly using an 
asynchronous method for differential delay compensation. The 
scheme is simple to implement, has low overhead, is robust in 
presence of cell loss within the switch, handles congestion, and 
tolerates wide variations in differential delays. 
1 Introduction 

Inverse multiplexing has been known for quite some time [ 1 ) as 
a technique to implement a single high-capacity logical channel 
by using several lower-capacity channels, where the distribution 
and aggregation of the traffic flow to and from the individual 
channels is transparent to the higher layers |2, 3]. More re- 
cently, inverse multiplexing has been applied within the context 
of Asynchronous Transfer Mode (ATM) networks, and standard- 
ized by the ATM Forum in the Inverse Multiplexing for ATM 
(IMA) specification in July 1997 (4 J. This specification defines 
a new protocol-stack entity, called the IMA sublayer, which is 
located between the Physical Layer Transmission Convergence 
sublayer and the ATM layer, and performs "inverse multiplexing 
and demultiplexing of ATM cells in a cyclic fashion among links 
grouped to form a higher bandwidth logical link whose rate is 
approximately the sum of the link rates " 

The IMA protocol is inherently designed to provide a point- 
to-point logical link. As shown in Figure 1(a), the protocol 
treats the entire cell stream over the logical link as a single 
entity. The protocol segments the cell stream into frames of 
a specified duration, and sends the frames across a group of 
.V independent links by dispatching the cells to the individual 
links in a synchronous cyclic order, using special Filler cells to 
maintain a continuous cell stream at the Physical layer in the 
absence of ATM layer traffic. The frames arc reconstructed at^. 
the far end of the IMA logical link using alignment information 
contained in overhead IMA Control Protocol (1CP) cells, which 
are inserted once per frame on each link. Using the ICP cells, the 
IMA protocol can compensate for different end-to-end delays on 
each link, provided that such differential delays stay relatively 
constant over time; in this way. it is possible for the individual 
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Figure 1 : Inverse Multiplexing: (a) Conventional point-to-point 
IMA; (b) Proposed Switched Connection Inverse Multiplexing 
for ATM (SOMA). 


links to take different paths in the network, and even pass through 
different switches, as long as the end-to-end delay variation is 
controlled. 

In this paper, we generalize the notion of inverse multiplexing 
to the context of scaling the port capacity of an ATM switch by 
grouping a number of ports of a certain available capacity into 
a "logical" port of capacity approximately equal to the sum of 
the port capacities. As the market demands switchesjeapabje of 
terminating trunks of higher and higher bandwidth, the issue of 
scaling the port capacity becomes crucial. Inverse multiplexing 
can potentially provide an economical way to interface the new 
higher-rate trunks with existing switches having ports operating 
at lower rate. It can also become a flexible option in archiiecting 
new switches to push the port rate beyond what is possible or 
cost effective in terms of port processing and port rates in the 
switching fabric. However, due to its intrinsic point-to-point na- 
ture and rather rigid handling of differential delays, conventional 
IMA is not suitable for this purpose. 

We introduce a more general inverse multiplexing scheme, 
called the Switched Connection Inverse Multiplexing for ATM 
(SCIMA), which removes the limitations of IMA. The context 
where SCIMA operates is illustrated in Figure Kb). A core 
switch with input and output ports of a given capacity R tcrmi- 


0-7803-4984-9/98/S 10.00 ©1998 IEEE. 


SOOCID: <XP 801610A_L* 


3135 


nates high-capacity transmission trunks. This is accomplished by 
grouping the input and output pons of the core switch, and asso- 
ciating each group with a controller connected to a high-capacity 
trunk. The SC1MA scheme operates in the input controllers to 
distribute the traffic arriving on the incoming trunk among the 
switching paths originating at that logical port and in the output 
controllers to perform aggregation of the traffic onto the outgoing 
trunk. For example, using SCIMA. currently available OC-12 
switch pons [5] can be grouped together to interface to emerging 
OC-48 and OC-192 trunks. 

SCIMA is a significant generalization of conventional IMA 
in three main aspects. First, SCIMA demultiplexes and aggre- 
gates the traffic separately for each Virtual Connection (VQ that 
constitutes the high-rate trunk traffic flow, rather than the entire 
high-rate stream as a whole; indeed, since SCIMA operates at 
the edges of a core switch, each VC has to be switched indi- 
vidually, based on its dest ination, allocated rate, traffic class. 
a nd other Quality of Service ( QaS) requirgmenK a$ pppn^ri 
to the point-to-point environment where IMA operates. Sec-, 
ond, SCIMA handles congestion and the possibility of cell loss; 
in fact, the switching paths within the switch contain buffers, 
shared by . independent bursty traffic flows, and may experience 
at times heavy congestion, contrary to the effectively buffcrless 
and congestion-free links in DMA. Finally, SCIMA deals with 
large, unpredictable differential delays as well as significant de- 
lay variations, caused by the rapidly varying cross-traffic and con- 
gestion conditions existing along the different switching paths; 
this requires an asynchronous scheme fo r cell re-ordering and 
differential delay compensation, as opposed to the synchronous 
method used in IMA. 

Our SCIMA scheme assumes a very general architecture for 
the core switch, and is therefore widely applicable. The scheme 
is simple to implement and requires very low ovgfceajj. To 
keep complexity low, SCIMA performs traffic demultiplexing 
(which we refer to as splitting) of only a selected set of VC Y 
The cell ordering i nformation .c onsists of a small number of 
bits, and is transmitted in the local switching header of each 
cell, in order to support the asynchronous approach of the novel 
Cell Ordering and Re -Assembly Protocol that we use in SCTM A. 
The per- VC traffic s plitting allows to transparently pass the QoS — 
guarantees provisioned by the core switch onto the high-rate 
trunk. SCIMA relies on the ability of the switch to execute a 
connection admission policy which, along with accepting a VC, 
makes an assignment of a split weight vector which specifies how 
the traffic for that VC should be split on the available switching 
paths. 

We provide a number of mechanisms to make SCIMA robust 
against cell losses in the core switch. First, we use backpressure 
to allow persistent cell losses (such as those due to buffer over- 
flow) only at the input controllers and at the output pons of the 
core switch, where it is easy to guarantee that they do not affect 
the Te-assemblyj)roccss. Second, we deyisc a cel l r e-orderin g 
scheme that is ro bust tojsolated losses (such as those caused 
by electrical glitches and sporadic errors) occurring at any point 
in the switch. Third, we provide an Advanced Frame Recovery 
Mode to promptly re- establish correct cell reorde ring in those 
situations where, despite the two previous mechanisms, a Joss 
pattern that cannot be handled by the re-assembly process has 
occurred. Finally, SCIMA copes with congestion in the differ- 
ent paths by using an Adaptive Cell Dispatching Algorithm that 
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Figure 2: SCIMA Architecture 


performs dynamic load balancing and avoids congested paths, as 
made possible by the asynchronous nature of the scheme. 

The issue of cell rc-assembly at the output of the switch that 
arises with inverse multiplexing in an ATM switch has been ad- 
dressed in the context of multi-path switch architectures. How- 
ever, the exis ting solutions are based either on sending tftquengg 
n umbers ui the cell headers, whic h is costly in terms of over- 
head [6, 71, or on an enforced delay equalization for specific 
input-output port pairs (8, 9] which effectively precludes QoS 
provisioning to the individual VCs in the inverse-multiplexed 
flow. 

The rest of this paper is organized as follows. In Section 2, we 
define the underlying switch architecture assumed by SCIMA. 
In Section 3, we motivate the per- VC traffic splitting approach, 
provide an overview of the SCIMA scheme, and discuss our strat- 
egy to handle cell loss. The individual functional components of 
SCIMA are discussed in Sections 4 through 6. Finally, Section 7 
provides concluding remarks. 
2 Model 

We begin with the description of the underlying switch archi- 
tecture. As shown in Fig. 2, the A* A' x A A* core ATM switch 
has a very general muustage architecture, which consists of A.Y 
input modules and A\Y output modules communicating with each 
other by means of an interconnection network. Buffers may be 
provided in both input and output modules as well as in the 
center-stage interconnection network. 

The ports of the core switch, each of capacity R bits/sec, are 
partitioned into groups of size A\ Each group is connected to a 
single inverse multiplexing controller handling a high-rate trunk 
of capacity A /?. The core switch together with the input and 
output controllers form therefore an A' x A' aggregate switch 
with pons of capacity kR bits/sec. Each input controller acts 
as a 1 x A* demultiplexer, and distributes the traffic from the 
high-rate input trunk to the pons in its corresponding group. 
The demultiplexed traffic flows in each input port are treated 
independently by the core switch and follow separate switching 
paths within the switch to the destination egress ports. The output 
controller acts as a A- x ] multiplexor, and consolidates the traffic 
arriving from different egress ports in the same group, while 
maintaining cell order in the aggregated flow. 

Control over buffering and cell loss within the aggregate 
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switch is attained through the use of backpressure, which may be 
applied at five different locations, as discussed in the next section: 
from the output controllers to the core switch output modules, 
from the output modules to the output of the interconnection 
network, from the outputs to the inputs of the interconnection 
network, from the interconnection network lo the input modules; 
and from the input modules to the input controllers. 
3 SCIMA Overview 
3.1 Splitting of Virtual Connections 

The obvious primary objective of SCIMA is to deliver the 
traffic of each VC arriving at the inputs of the aggregate switch to 
its destination, while maintaing the cell order for that VC. When 
the required bandwidth of the VC is higher than the capacity of 
the ports in the core switch, inverse multiplexing of the traffic 
of that VC across the core switch must be performed. We refer 
u> this operation as splitting the VC, meaning that the traffic 
flow for that VC is split at the input controllers into independent 
subflows (which we call subconnections), each sent to a different 
input pott of the core switch and treated independently by the 
core switch; the subconnections are reassembled at the output 
controllers into an aggregate flow that maintains the cell order of 
the original flow. 

The alternative approach to splitting at the VC level would be 
splitting the entire flow associated with each pair of input and out- 
put ports in the aggregate switch among the available switching 
paths. Although this approach can achieve almost ideal pooling 
of the switching capacity and load balancing over the switching 
paths, it cannot provide differentiated Quality of Service to traf- 
fic classes and individual VCs within a single flow. Splitting at 
the VC level makes it possible to transfer transparently the QoS 
properties of the core switch to the aggregate switch. 

While it is clear that VCs of bandwidth higher than the ca- 
pacity R of the pons in the core switch have to be split, it may be 
desirable to also split VC's of bandwidth lower than R in order 
to ensure l oad distribut ion and balancing within the set of the 
switching paths which exist between each input and output pair 
in the aggregate switch. The trade-off here is between increased 
utilization and reduced call blocking probability, on one hand, 
and complexity and overhead to re-assemble the split VC's, on 
the other. We have observed that a relatively small number of 
split VC's is sufficient to guarantee good utilization and blocking 
probability in most circumstances (a complete performance char- 
acterization of the aggregate switch as a function of the number 
of split VC's is the topic of a forthcoming paper). In order to 
keep implementation complexity and overhead low, SCIMA uses 
this observation and restricts splitting to a selected set of VC's, 
while routing the others through individual switching paths. In 
other words, the number of admitted split VC's is treated as a 
special kind of manageable resource (similar to line bandwidth 
and buffer space), which has to be taken into consideration by 
the local Connection Admission Control (CAC) function. 

In summary. SCIMA provides the capability of splitting a 
(relatively small) number S of VC's; splitting of a VC that is 
admitted by CAC is performed in the following circumstances: 

• If the bandwidth B requested by the VC exceeds the port rate 
R of the core switch (R < B < kR) t then the VC is split. 
There may be at most k - 1 such VC's that can be admitted 
(again, this is the minimum number of VC's that SCIMA must 
be able to split; k - I of the 5 VCs that SCIMA can split are 
always reserved for this purpose). 


• If the bandwidth B requested by the VC is lower than /?. but 
the residual switching capacity of any available switching path 
is not sufficient to accommodate the VC, while the request 
can be granted if the residual capacities of several paths arc 
pooled, the CAC may decide to accept the VC by splitting it. 

♦ If the bandwidth B requested by the VC is lower than R 
and can be accommodated by a single switching path, but 
admission without splitting is likely lo create a highly uneven 
load distribution among the different switching paths, thus 
increasing the differential delay for the existing VC's. the 
CAC may decide to split that VC. 

3.2 Load Distribution 

The distribution of load within the aggregate switch is per- 
formed both statically and dynamically by SCIMA. Sialic load 
distribution occurs at the connection admission stage. 

Let C r be a connection request arriving at the high-rate input 
trunk » which has to be routed to the high-rate output trunk rf. 
The CAC strategy (the details of which are outside the scope of 
this paper) is based on the currently allocated resources and the 
traffic descriptor of C, . If the admission decision is positive, a 
split weight vector 

ir<-> = «•!;»). 

such that jy!m\ u ! r| = is assi 8 ned 10 ^' r; " is lhe num ° er °f 
available switching paths between the a-th ingress port group and 
the t/-th egress port group. (If CV is admitted without splitting 
and routed through a single switching path t, then w\ n = 1.) 
In order to achieve a stricUy-non-blocking aggregate switch, » 
should be equal to k 1 ; however, in practice, we have observed 
that imposing lhe restriction of a single switching path per port, 
when used with proper CAC, does not severely compromise the 
blocking properties. Therefore, it is often sufficient to assume 
ii = k. 

SCIMA distributes the traffic of each split VC to the switch- 
ing paths in proportion to the specified split weights. In addi- 
tion, to handle congestion, SCIMA also performs dynamic load 
re-distribution in the course of regular switch operation, using 
backpressure information from the input modules in the core 
switch, as discussed in Section 6. Due to the asynchronous ap- 
proach in the cell re-assembly protocol, SCIMA can avoid to 
send traffic to a path that is experiencing congestion, and redirect 
it to another path. 

3 3 Confronting Cell Loss 

In this subsection, we outline the strategy that we use in 
SCIMA to handle cell loss in the switch. 

For our purpose, we categorize cell loss into two classes: pre- 
ventable and inevitable. The p reventable loss es are associated 
with a specific switch functionality and can, at least in princi- 
ple, be avoided by pro per traffic engineeri ng, control policy, or 
switch architecture. This cTa ssJs jre prescntcd, for example, by 
the cell loss connected lo buffer overflow or usage parameter 
control (policing). These losses may occur in isolation or in 
bursts affecting many consecutive cells. The inevitable josses 
are caused by uncontrollable stochastic external factors. These 
are exemplified by r andom electrical glitches leading to cell de- 
lineation errors, header corruption, mis-routing, etc. Typically, 
most of the inevitable losses are isolated, i.e., only a single cell 
is affected within a relatively large time interval. Occasionally, 
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such a type of loss may be persistent and afTcct more than one 
cell in a row. 

The strategy that we use in SCIM A aims at making the scheme 
robust against all types of cell loss while keeping the over- 
head low. We achieve this objective by restricting where the 
preventable losses occur for the split connections, making the 
scheme able to handle single losses at all points in the switch, 
and providing a way to recover from the occasional inevitable 
losses that are persistent 

As far as preventable losses are concerned, we can control 
where they occur through the use of backpressure. Specifically, 
we restrict the preventable losses to occur only where it is simple 
for SCIMA to detect and recover from them, namely, in the 
input and output controllers and in the output ports of the core 
switch. We guarantee that no cell is lost in the center-stage 
interconnection network by applying backpressure to the input 
modules of the core switch. Similarly, we guarantee that no cell 
for the split connections is lost in the input modules by applying 
backpressure (selectively for the split connections) to the input 
controllers. In addition, again only for the split connections, 
such functions as policing and Early/Partial Packet Discard are 
disabled or transferred to the input controllers. We allow losses 
for the split connections at the output pons of the core switch. 
In this case, it is relatively simple to detect a loss and report it 
to the output controller. The loss report contains the SCIMA 
protocol portion of the discarded cell's local header (described 
in Section 4 below), so that the egress re-assembly can proceed 
uninterrupted even when the cell itself has been dropped. 

Contrary to preventable losses, the locations in the switch 
where inevitable losses may occur cannot be controlled. How- 
ever, as mentioned above, such inevitable losses occur with high 
probability in isolation. For this reason, we have made SOMA 
robust to isolated losses anywhere in the switch. 

Obviously, persistent inevitable losses, although not very 
likely, cannot be excluded. Consequently, we have made SCIMA 
capable of detecting a persistent loss, up to a certain maximum 
duration (the details are discussed in Section 4 below). When the 
scheme detects a persistent loss, c ell re-orderingis mo mentarily 
suspended, and an Advanced Frame Recovery Mode is invoked 
to resynchronize the re-assembly process, lfthe duration j>fjhe 
persistent loss is larger than the maximum duration recognized 
by SCIMA, it is possible that the scheme cannot detect the loss 
or that it cannot resynchronize. However, in a practical switch, 
other means are typically provided to delect very long persistent 
inevitable losses, such as those due to permanent faults, and prop-, 
erly communicate the event to the various hardware components 
in the switch. 

3 A Functional Components of SCIMA 

Once a split VC is admitted, the Cell Ordering and Re- 
assembly Protocol is instantiated for that VC om the correspond- 
ing pair of input and output controllers. This protocol, described 
in Section 4 below, is responsible for the re-assembly of the VC 
flow by exchanging cell ordering information, which is passed 
transparently from the ingressjo the egress of the switch in the 
cell's local routing header. The protocol is consistent with the 
cell loss handling strategy described above, and is therefore ro- 
bust against isolated losses, as well as able to detect persistent 
losses. When a persistent loss is detected, the Advanced Frame 
Recovery Mode described in Section 5 below is invoked. 

Finally, each input SCIMA controller performs the slot-by- 
slot cell dispatching of the admitted ATM VCs, both split and 


unspliu to the switching paths of the associated core switch port 
group. The execution of this task is ensured by the Adaptive 
Cell Dispatch Algorithm described in Section 6, which is able to 
dynamically redirect a split flow destined to a congested path to 
a different path. 

4 Cell Ordering and Re-assembly Protocol 

4.1 Problems with Re-Ordering 

Before describing the protocol itself, we identify some diffi- 
culties involved in designing a suitable split-connection cell re- 
ordering scheme. Here, we are interested in devising a scheme 
that requires low overhead and is robust to isolated losses. In 
addition, the scheme must work properly regardless of how the 
cells of a split VC are actually distributed among the possible 
paths, and not rely on all the paths to be used at a given time. 
This is necessary for our dispatching algorithm to be able to 
redirect flows around paths that are experiencing congestion. 

A solution that has been proposed in the past for cell re- 
assembly (6, 7] is to transmit a sequence number in the cells* 
local header. The output controller then buffers the cells of each 
subconnection in a First-In-First-Out (FIFO) queue, and recon- 
structs the flow by serving the cells out of the subconnection 
FIFO queues by increasing order of their sequence number 

The problem with this solution is that, in order for it to work 
properly, the period with which the sequence number repeats (the 
sequence number consists of a finite number of bits in the local 
header of the cells) must be longer than the largest number of 
cells that can simultaneously be transiting the switch on different 
subconnections. This implies that the differential delays between 
subconnections be bounded and, for the overhead to be low, that 
such a bound be relatively small. Unfortunately, this is in general 
not the case in current switches, where the buffer capacities from 
an input to an output may be of the order of several million cells, 
and where cells are not necessarily served in a FIFO order but 
according to more complex QoS scheduling. The problem caused 
by a period of the sequence number that has insufficient length 
is that more than a single cell with the same sequence number 
can simultaneously be in transmission, leading to the possible 
scenario illustrated in Fig. 3: when, due to congestion, the cell 
dispatch procedure temporarily skips some of the switching paths 
(path PO in the example), the cells at the head of two or more 
subconnections FIFO queues in the re -assembly buffers may have 
conflicting sequence numbers. 
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Figure 3: Sequence numbers causing re-assembly failure. 

Complementing the total order sequence number with a 
PREV .PATH pointer which identifies the switching path to which 
the preceding cell of the same split VC was dispatched does not 
present an improvement For example, in the scenario of pig. 4, 
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the same two cells which caused confusion in the previous ex- 
ample would still remain indistinguishable. 
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Figure 4: PREVJWH pointer in combination with sequence 
numbers causes re-assembly failure. 

Transmission of a per-VC sequence number along with a 
NEXT-PATH pointer, which ideniifies the subsequent switch- 
ing path, allows to remove ambiguity in the re-assembly process. 
However, when even a single cell is lost in transmission, this 
approach may fail to detect it Such a scenario, which involves 
two cells with identical sequence numbers being transmitted over 
the same switching path, is shown in Fig. 5. 
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Figure 5: NEXT-PATH pointer in combination with sequence 
numbers let a cell loss go undetected. 

42 Description of the Protocol 

Our Cell Ordering and Re-assembly Protocol takes advan- 
tage of the NEXT-PATH pointers combined with the use of per- 
subconnection (rather than per-VC) sequence numbers. In addi- 
tion, to achieve robustness in the event of a single cell loss, each 
NEXT-PATH value is transmitted twice: with its proper cell and 
in the header of the subsequent cell of the same subconnection. 
Figure 6 shows the formal of the SOMA portion of local rout- 
ing header. A" is the number of switching paths in a group, S 
the number of simultaneously supported split VC's, and M the 
sequence number modulus. 
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Figure 6: SOMA protocol fields of the cell's local header. 

The SiJilJD field is actually not part of the SOMA over- 
head, since it is accommodated in the connection identifier field 


that is anyway present in the local switching header. The se- 
quence number modulus M depends only on the duration of the 
persistent inevitable losses that SCIM A needs to detect. As men- 
tioned above, since in a practical switch other means arc provided 
to detect very long persistent inevitable losses, a small value of 
M is typically sufficient. 

The input SCIM A controller maintains the following data for 
each split VC: 

• a pointer P„.* t to the switching path scheduled for the next 
cell transmission; 

• a vector A'.. „w[l : A*] indicating the sequence number of the 
next cell for each of the 1\ subconnections; 

• a vector /V'<-[1 : A*] of the P n * TI pointers that were transmii- 
ted with the most recent ceil of each of the A" subconnections; 

Two registers CRXT and PRED are used for the temporal 
storage of the current switching path and the predicted switching 
path, respectively. 

On admission of a split connection, Pneu is initialized to a 
pre-negouated value. ;Y,. orf [l : K] arc all zeros, and P t , r ., [1 : 
A" J are irrelevant. Subsequently, when a cell belonging to the 
given split VC is being transmitted, the input controller takes the 
following steps: 

(11) copies the current value of Pn e xi to register CRXT and 
uses the cell dispatch algorithm to identify the switching 
path to be used for the transmission of the next cell of the 
same VC, storing this value in PRED ; 

(12) modifies the information fields of the cell's local routing 
header (refer to Fig. 6) as follows: 

StqJk <- X,<»<i[CRXT); 
Xtrt.Path <- PRED; 
Pr«v-S*TtJ>ath <- P pr , t [CRXT); 

(13) updates the stored P r ,, v vector 

P pr *v[CRXT)i- PRED; 
04) increments the sequence number of the switching path in 
use: 

;V Mm ,[ CRXT ] 4- ( X t end[ CRXT] + 1 ) mod M; 

(15) stores the identity of the next path to be used for the given 
split VC: 

P m „ <- PRED; 

(16) dispatches the cell to the switching path identified by the 
register CRXT , 

On the receiving side, the output SOMA controller maintains, 
also on a per split connection basis, the following data structures: 

• asetQf/ro[l : A"] of re-assembly buffers with FIFO service 
discipline; 

• a pointer P. to the switching path on which the next cell of 
the given split VC is expected; 

• a vector X r ft j[! : A"] indicating the expected sequence num- 
bers of the cells from the respective subconnections. 

On connection admission. P. ir is initialized to the same prc- 
negotiated value as P n . j t at the ingress. The sequence numbers 

.Y, ,,.,,[ 1 : M 851 10 zm - Wncn a cel * arri vcs <> n a 
switching path p, it is enqueued in Qnr\>\)>). 

The output SOMA controller performs the following steps: 

(01) periodically checks the re-assembly FIFO indexed by P.. A} . 
and. when it becomes occupied, inspects the sequence num- 
ber f/_Yo of its first cell; 
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Figure 7: Pcr-subconnection sequence numbers in combination 
with NEXT-PATH pointer of the current and the previous cells 
of the same subconneciion. 


(02) if Seq^Xo matches .Vr C ».<f[^**p]. ^en the expected se- 
quence number and switching path index are updated as: 

AVcrd(P« P ] <- (Sfq m .\'o + 1| mod.W; 
Ptir <- AVxr JPath ; 
the cell is dequeued, and step (01 ) is repeated; 

(03) if ( St q.\o = ;V rei . rf [P, X| J + 1 ) mod then a cell loss is 
declared and the cell is left at the head of the FIFO queue. 
The data structures are updated: 

S'rtrd[P,s P ] <- (Sev-Ab+ 1 ) mod ,1/. 
P«x P PrtvJfeTtJ>ath . 
and the control is returned to step (OI ); 

(04) if SeqJko differs from the stored expected value by more 
than 1 , a severe cell loss is declared leading to the invocation 
of the Advanced Frame Recovery Mode. 

Figure 7 provides an example of the protocol operation. The 
solid arrows point to the next switching path to be used, whereas 
the dash arrows indicate the retransmitted pointer values. A 
unit gap in the pcr-subconnection sequence numbers, as is the 
case with switching path P3, allows to detect a single cell toss 
event. Although the next switching path pointer of the lost cell 
(dotted arrow) is missing, the split VC cell chain can be restored 
using the pointer value retransmitted with the next cell of the 
subconneciion (wide dash arrow). 

We observe that the use of the NEXT-PATH pointer requires 
the dispatching to be predictive, i.e., the input controller has to 
make the selection for the path of the next cell routing prior to the 
departure of the previous cell of the same subconnecuon, even if 
the next cell itself has not yet arrived. 

4 3 Properties of the Protocol 

In summary: 

• If no losses occur and the differential delays between subcon- 
nections is finite, the protocol rc-assembles cells in the same 
order as they were dispatched. Unambiguous re -assembly is 
ensured. 

• The protocol automatically recovers the cell re-assembly 
chain if not more than one cell from the same subconnec- 
iion is lost in a row. 

• The protocol detects an occurrence of cell loss within a sub- 
connecuon, if the number L of consecutively lost cells satisfies 
I < LmodM < .W- \;\(L is a multiple of. V, the protocol 
fails to detect the cell loss. 


• The protocol recovers incorrectly if the number L of con- 
secutively lost cells in a subconnecuon exceeds M and 
Lmod M = 1. 

Finally, we observe that, in general, depending upon the avail- 
ability of the control bandwidth, the recovery from L t consec- 
utive losses [L, < .U) can be achieved by transmitting, in the 
local header of each cell, the L r most recent P n , Jt pointers of 
the same subconnecuon. 
5 Advanced Frame Recovery Mode 

The Advanced Frame Recovery Mode is a robust extension of 
the Cell Ordering and Rc-assembly Protocol based on the regu- 
lar insertion of sequentially numbered checkpoints into the cell 
stream of the split VCs. The checkpoints partition each VC cell 
stream into frames. A portion of a frame dispatched via the same 
switching path is referred to as a sub frame. The switching path 
(and the corresponding subconneciion) that carries the first cell 
of a frame, becomes the leader of that frame. Since the cells of 
each split VC are dispatched in FIFO order, the cells belonging 
to different subframes can not interleave within a subconnecuon. 
Therefore, if the re-assembly algorithm aligns each subconnec- 
uon at a subframe corresponding to the same checkpoint and 
correctly selects the leader, then the cell sequence can be re- 
stored from that point on. Regularity of checkpoint insertion 
does not necessarily imply periodicity; instead the checkpoints 
may follow, for example, the boundaries of the transport layer 
protocol frames. 

The checkpoints are implemented with a painted header, a 
special form of the local routing header which is assigned to 
the first cell of each non-empty subframe. The painted header 
contains the frame sequence number and the leader flag (alter- 
natively, the leader flag can be eliminated by always sending the 
leader on one pre-negotiatcd path). The length of the frame se- 
quence number field depends on the maximum number of frames 
that SOMA must tolerate to loose, and is typically small. With 
careful design, it is possible to superimpose the entire painted 
header on existing fields in the local header, thus not adding ad- 
ditional overhead. When the Advanced Frame Recovery Mode 
is invoked on a split VC, the output SOMA controller repeatedly 
flushes the subconneciion re-assembly queues Qriro[ I : A*] 
until a painted header cell is encountered. Flushing is repeated 
unul all A' queues are aligned at the beginning of the same frame, 
or are empty, and exactly one leader is found among the painted 
headers. At this point P <xp is reset and the sequence numbers 
AVr. j[\ : A*] are re-initialized. Loss of a painted header cell is 
not catastrophic as the controller will merely try to align all the 
subconnections at the beginning of the next frame. 
6 Adaptive Cell Dispatch Algorithm 

Finally, we describe the algorithm which dispatches the cells 
to the individual paths and performs dynamic load balancing 
according to the congestion state of the paths. The algorithm runs 
in the input controllers and serves two purposes: first, for every 
split VC, it predict! vely computes the sequence of switching paths 
to which the cells have to be routed; second, for every port in the 
core switch, it performs scheduling between the subconnections 
and the unsplit portion of the traffic destined to that port. 

For the split connections CV. r = 1 5. the algorithm 

distributes the traffic load in proportion to the assigned split 
weights U" ,r \ irrespectively of the cell arrival process. It also 
accounts for the backpressure information provided by the input 
ports of the core switch, selectively for each subconnection. As 
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mentioned above* the backpressure mechanism is not intended 
to decrease the losses for the split VC's. but rather to transfer the 
losses to the input controllers, before any dispatching decision 
has been made. The unsplit VC's. which are not subject to 
backpressure, may still experience congestion and loss in the 
input pons of the core switch. 

The actual dispatching mechanism consists of a two-level 
Weighted-Round-Robin (WRR) scheduler, whose structure is 
schematically shown in Fig. 8 (for simplicity, as discussed in 
Section 4 above, we assume that a split connection is always 
associated with a single switching path per port): 

• A first level WRR scheduler is instantiated for every split con- 
nection CV using the split weight vector U' 1 ' 1 , to dispatch the 
cells of the VC to the switch pons in the group in proportion 
to the split weights. For each split VC, this stage consists of 
a single pre-split FIFO queue to store the cells before they 
are committed to a particular port in the core switch, and one 
subconnection buffer to store cells immediately before they 
are sent to a selected port Since it is desirable to commit a 
cell to a specific port as late as possible, each subconnection 
buffer typically only needs to store a single cell. 

In order to be consistent with our cell ordering and re- 
assembly scheme described above, the first level scheduler 
actually acts on the next pointer P ntl t rather than on the cell 
itself (in other words, commits the next cell rather than the 
current cell). In selecting where the next cell is going to be 
dispatched, the scheduler takes into account the backpressure 
information for that VC, and does not send cells to paths ex- 
periencing congestion, redirecting that traffic to other paths. 
Since the scheduler is actually scheduling "next cells", the 
effect is that of an increased latency in backpressure having 
effect on the traffic, and therefore the core switch needs to re- 
serve one cell buffer space per subconnecuon to absorb such 
latency. 

• A second level WRR scheduler is associated with each port i 
of the core switch attached to the input controller, and arbi- 
trates among all the unsplit VC's and the split subconnecuons 
destined to port i. In this scheduler, the subconnecuons cor- 
responding to a split connection CV are served with weights 
w\ r) x £ M t where w\ n is the split weight of the subcon- 
nection and B lr) is the requested bandwidth of connection 
C r . In addition, the scheduler serves a single unsplit FIFO 
port queue where all the cells of the unsplit VC's destined to 
that port are directly queued on arrival. The weight of this 
queue is equal to the sum of the bandwidths of the unsplit 
VC's destined to port r. This second level of the scheduler is 
not affected by backpressure information. 

In practice, this hierarchical WRR mechanism can be modified 
to synchronize the two levels of the scheduler. 
7 Conclusions 

In this paper, we have generalized the concept of inverse multi- 
plexing to the context of switched ATM connections. The scheme 
allows to use a fr.Y x k X ATM switch with pons operating at 
line rate ft to implement an .V x .V switch with pons operating 
at line rate kR. This avoids large re-design costs of the switch- 
ing equipment when interfacing high-capacity trunks to existing 
switch cores having lower-rate ports. 

Whereas the conventional Inverse Multiplexing for ATM 
(IMA) performs inverse multiplexing of the entire ATM traffic 
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Figure 8: Input SCIMA controller two-level cell dispatch mech- 
anism. 

stream as a whole, using synchronous techniques for differential 
delay compensation, and is therefore suited for bufferless and 
lossless point-to-point transmission links, the Switched Connec- 
tion Inverse Multiplexing for ATM (SCIMA) presented in this 
paper is specifically designed to operate on a pcr-VC basis in 
presence of cell loss, congestion, and large unpredictable differ- 
ential delays. We have designed a set of supporting protocols, 
which includes split connection cell ordering and re-asscmbly, 
adaptive cell dispatching with prediction, and advanced frame 
recovery in the presence of severe faults. 
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